fix(node): close under-withholding via full ref scope and reachable-only pin set#46
fix(node): close under-withholding via full ref scope and reachable-only pin set#46beardthelion wants to merge 4 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 49 minutes and 37 seconds. Learn how PR review limits work. To continue reviewing without waiting, enable usage-based billing in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughThis PR implements end-to-end encryption and withholding for Phase 3 confidentiality: withheld blobs are sealed to recipient keys via ephemeral X25519 envelopes, encrypted with XChaCha20Poly1305, pinned to IPFS with opaque recipient tags, anchored to Arweave, replicated across mirrors using promisor-based filtering, and recovered locally via a new ChangesCore encryption and visibility foundations
Server-side filtered pack serving and encrypted blob management
Mirror mode classification and encrypted blob replication
Client-side clone command with recovery paths
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
crates/gl/src/clone.rs (1)
262-267: ⚡ Quick winParse
/encrypted-blobsinto a typed response instead ofValue.A missing or mistyped
blobsfield currently degrades to[]and silently disables node-side recovery. Making this schema-strict would catch server/client contract drift the same wayWithheldPathsResponsealready does.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gl/src/clone.rs` around lines 262 - 267, Replace the loose serde_json::Value parsing for resp in clone.rs with a strongly typed EncryptedBlobsResponse struct so schema mismatches surface as errors: define EncryptedBlobsResponse { blobs: Vec<EncryptedBlobOrString> } (matching the server shape, analogous to WithheldPathsResponse), then call resp.json::<EncryptedBlobsResponse>().await.context("parsing encrypted-blobs")? and use the returned .blobs directly instead of body.get(...).and_then(...).cloned().unwrap_or_default(); this ensures missing or mistyped "blobs" no longer silently becomes [] and will fail fast (update the EncryptedBlobOrString type to the actual blob element type used).crates/gitlawb-node/src/api/encrypted.rs (1)
109-123: ⚡ Quick winAdd a handler test for the 404/404 access-control contract.
The new test only locks the wire shape. A small handler-level test that exercises both “caller denied” and “OID missing” on
get_encrypted_blobwould protect the deliberate non-oracle behavior from future refactors.Based on learnings, this endpoint should keep “not found” and “not authorized” externally indistinguishable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/api/encrypted.rs` around lines 109 - 123, Add a handler-level test for the access-control contract of get_encrypted_blob that verifies the external response is indistinguishable when the caller is denied versus when the OID is missing: write two tests (or one parameterized test) invoking the get_encrypted_blob handler — one scenario where the auth check fails (simulate "caller denied") and one where the blob store returns NotFound (simulate "OID missing") — and assert both responses have the same HTTP status and body (e.g., both return 404/not-found) to ensure the endpoint does not leak authorization state; reference the get_encrypted_blob handler in your test and reuse existing test scaffolding in the encrypted.rs tests module.Source: Learnings
crates/gitlawb-core/src/encrypt.rs (1)
66-88: ⚡ Quick winZeroize transient key material in
seal_blobandopen_blob.
content_key, the seed-derived X25519 secret, and the unwrapped content key all live in plain stack/heap buffers until drop. In this confidentiality path that is avoidable exposure; keep them inZeroizing(or explicitly callzeroize()) once the ciphers are initialized.Hardening sketch
+use zeroize::Zeroizing; + - let mut content_key = [0u8; 32]; - OsRng.fill_bytes(&mut content_key); + let mut content_key = Zeroizing::new([0u8; 32]); + OsRng.fill_bytes(&mut *content_key); let body_cipher = XChaCha20Poly1305::new_from_slice(&content_key) .map_err(|e| anyhow::anyhow!("content key: {e}"))?; ... - let my_x = XSecret::from(x25519_secret_from_seed(&keypair.seed_bytes())); + let seed = keypair.to_seed(); + let my_x = XSecret::from(x25519_secret_from_seed(&seed)); ... - let content_key = content_key.context("not a recipient of this envelope")?; + let content_key = Zeroizing::new(content_key.context("not a recipient of this envelope")?);Also applies to: 126-167
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-core/src/encrypt.rs` around lines 66 - 88, In seal_blob and open_blob, sensitive key material (the 32-byte content_key, the ephemeral XSecret `eph`, the seed-derived X25519 secret, and any unwrapped content key buffers) must be zeroed after use: wrap stack/heap buffers in zeroize::Zeroizing (e.g., Zeroizing<[u8;32]>, Zeroizing<Vec<u8>>) or explicitly call zeroize() on types that implement Zeroize immediately after the ciphers are initialized and after encryption/decryption completes; ensure you zero `content_key`, the ephemeral secret `eph` (or call its zeroize method), the seed-derived X25519 secret produced by x25519_private/related variable, and the unwrapped content key before returning or pushing into Recipient so no plaintext key remains in memory.crates/gitlawb-node/src/git/visibility_pack.rs (1)
101-130: 🏗️ Heavy liftAvoid re-traversing the repo to derive recipients.
withheld_blob_recipients()does one fullblob_paths()sweep throughwithheld_blob_oids()and then a second fullblob_paths()sweep for the recipient pass. On the push path,git_receive_pack()has already computed the withheld set before it gets here, so repos with visibility rules now pay up to three completels-treetraversals per push. Threading the precomputed withheld/blob-path data into this API would keep recipient derivation single-pass.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gitlawb-node/src/git/visibility_pack.rs` around lines 101 - 130, withheld_blob_recipients currently re-traverses the repo by calling withheld_blob_oids(...) and then blob_paths(...), causing multiple ls-tree sweeps; change withheld_blob_recipients signature to accept precomputed withheld data (e.g., Option<&HashSet<String>> or Option<&HashMap<String,String>> representing withheld oids or oid->path map) so callers like git_receive_pack can pass the already-computed set/map, skip the internal withheld_blob_oids call when provided, and iterate the provided oid->path mapping (or filter blob_paths once) to build recipients; update callers (notably git_receive_pack) to supply the precomputed data and keep behavior identical when None is passed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-core/src/identity.rs`:
- Around line 55-59: The new seed_bytes() method leaks a non-zeroizing copy of
the Ed25519 seed; replace its usage with the existing to_seed() which returns a
Zeroizing<[u8;32]>, or change seed_bytes() to return a zeroizing wrapper (e.g.
Zeroizing<[u8;32]>) instead of plain [u8;32]; update callers such as the code
paths that call seed_bytes() (including open_blob and the encrypted-pin task in
crates/gitlawb-node/src/api/repos.rs) to accept and use the Zeroizing seed or to
call to_seed() directly so the raw seed is never held in plain memory.
In `@crates/gitlawb-node/src/api/repos.rs`:
- Around line 398-425: Short-circuit the expensive withheld walk by detecting
when the repo's rules cannot withhold any path-scoped blobs (e.g., rules is
empty or only applies to root) before calling
visibility_pack::withheld_blob_oids; if that check shows no possible withheld
blobs, skip the tokio::task::spawn_blocking(...) and call
smart_http::upload_pack(&disk_path, body).await directly, otherwise proceed to
spawn_blocking and then choose between smart_http::upload_pack and
smart_http::upload_pack_excluding based on the resulting withheld set; use the
existing variables and functions (rules, visibility_pack::withheld_blob_oids,
withheld, smart_http::upload_pack, smart_http::upload_pack_excluding) to
implement the early return.
In `@crates/gitlawb-node/src/encrypted_pin.rs`:
- Around line 77-80: Replace silent `continue` branches when
`crate::git::store::read_object(repo_path, oid)` and the IPFS pin operation fail
with warning logs that include contextual details (repo_path, oid, and the
error/result) so failures are observable; in the match arms around `let data =
match crate::git::store::read_object(repo_path, oid) { ... }` and the subsequent
IPFS pin match, call the module’s logger (e.g., process_logger.warn or
log::warn) with a concise message and the error/result, then continue as before
to preserve behavior.
- Around line 72-76: The current code builds `keys: Vec<VerifyingKey>` by
`filter_map(|d| did_to_key(d))`, which silently drops unresolved DIDs and
proceeds to encrypt to a partial recipient set; change this so any unresolved
DID causes the blob to be skipped: instead of `filter_map`, resolve all DIDs and
detect failures (e.g., collect Options or compare `keys.len()` to `dids.len()`
after mapping via `did_to_key`), and when a mismatch is found log a warning
referencing `oid` (similar to the existing log) and `continue` so you do not
seal to a partial set; ensure you keep the `VerifyingKey` handling consistent
with downstream code that expects `keys`.
In `@crates/gitlawb-node/src/ipfs_pin.rs`:
- Around line 75-85: The cat function creates a reqwest::Client with no timeout
(reqwest::Client::new()) which can hang get_encrypted_blob and the sync worker;
modify cat (or callers) to use a client with a bounded timeout (e.g., build a
Client via reqwest::Client::builder().timeout(Duration::from_secs(...)) and
reuse it) or apply a per-request timeout (e.g., wrap the send/bytes await with
tokio::time::timeout) so the POST to {}/api/v0/cat?arg={} fails fast on
slow/stalled Kubo; update the cat signature or module to accept/reuse a Client
or enforce a timeout around resp = client.post(&url).send().await and
resp.bytes().await to avoid indefinite hangs.
In `@crates/gitlawb-node/src/sync.rs`:
- Around line 181-188: The code currently treats a None result from
fetch_withheld as MirrorMode::Plain which incorrectly downgrades unknown/error
states; instead, check the Option returned by fetch_withheld and only set
MirrorMode::Plain when you get Some(empty); if fetch_withheld returns None
preserve the repository's existing promisor/partialclone state (do not call
classify_mirror on None). Concretely, change the branch around
fetch_withheld/classify_mirror to: if withheld.is_some() { let mode =
classify_mirror(withheld); } else { read the current repo promisor/partialclone
setting from the local repo at local_path (or otherwise keep the current
MirrorMode) and use that mode for fetch_repo/clone_repo }, so
fetch_repo/clone_repo are invoked with the preserved mode rather than
unconditionally using MirrorMode::Plain.
In `@crates/gl/src/clone.rs`:
- Around line 428-459: The current GraphQL fetch (the local `query` +
`gql_body`, use of `client.post(...).json(&gql_body).send().await`, and
subsequent `parse_tx_ids(&gql)`) only requests first:100 and stops, causing
missing manifests; change the logic to paginate: rewrite the GraphQL query to
take variables for "first" and "after" (cursor), then loop POSTing the query
(building `gql_body` with `{ "repo": slug, "first": batch_size, "after": cursor
}`), call `parse_tx_ids` (or adjust it to return edges + pageInfo) to collect tx
ids across pages, update `cursor` from pageInfo.nextCursor/hasNextPage until no
more pages, then proceed to fetch manifests and call
`merge_manifests(manifests)` as before; ensure the loop uses the same
`client.post(...).json(&gql_body).send().await` error handling as existing code
and retains the existing behavior when responses fail.
- Around line 415-416: The slug normalization and Arweave discovery/parsing need
tightening: ensure owner short-form uses the same split logic as the node by
keeping owner_short = owner.split(':').next_back().unwrap_or(owner) and forming
slug = format!("{owner_short}/{name}"); update the GraphQL discovery for
encrypted-manifest anchors in discover/recovery code to paginate instead of
hard-coding first:100 (iterate queries until no more edges or timestamps
covered) so recover_encrypted_blobs sees all anchors; change
recover_encrypted_blobs to deserialize the /encrypted-blobs response into a
strict typed struct (like the existing withheld-paths type) and fail on
missing/wrong fields instead of using Value with unwrap_or_default(), ensuring
schema-strict parsing and surfaced errors.
---
Nitpick comments:
In `@crates/gitlawb-core/src/encrypt.rs`:
- Around line 66-88: In seal_blob and open_blob, sensitive key material (the
32-byte content_key, the ephemeral XSecret `eph`, the seed-derived X25519
secret, and any unwrapped content key buffers) must be zeroed after use: wrap
stack/heap buffers in zeroize::Zeroizing (e.g., Zeroizing<[u8;32]>,
Zeroizing<Vec<u8>>) or explicitly call zeroize() on types that implement Zeroize
immediately after the ciphers are initialized and after encryption/decryption
completes; ensure you zero `content_key`, the ephemeral secret `eph` (or call
its zeroize method), the seed-derived X25519 secret produced by
x25519_private/related variable, and the unwrapped content key before returning
or pushing into Recipient so no plaintext key remains in memory.
In `@crates/gitlawb-node/src/api/encrypted.rs`:
- Around line 109-123: Add a handler-level test for the access-control contract
of get_encrypted_blob that verifies the external response is indistinguishable
when the caller is denied versus when the OID is missing: write two tests (or
one parameterized test) invoking the get_encrypted_blob handler — one scenario
where the auth check fails (simulate "caller denied") and one where the blob
store returns NotFound (simulate "OID missing") — and assert both responses have
the same HTTP status and body (e.g., both return 404/not-found) to ensure the
endpoint does not leak authorization state; reference the get_encrypted_blob
handler in your test and reuse existing test scaffolding in the encrypted.rs
tests module.
In `@crates/gitlawb-node/src/git/visibility_pack.rs`:
- Around line 101-130: withheld_blob_recipients currently re-traverses the repo
by calling withheld_blob_oids(...) and then blob_paths(...), causing multiple
ls-tree sweeps; change withheld_blob_recipients signature to accept precomputed
withheld data (e.g., Option<&HashSet<String>> or Option<&HashMap<String,String>>
representing withheld oids or oid->path map) so callers like git_receive_pack
can pass the already-computed set/map, skip the internal withheld_blob_oids call
when provided, and iterate the provided oid->path mapping (or filter blob_paths
once) to build recipients; update callers (notably git_receive_pack) to supply
the precomputed data and keep behavior identical when None is passed.
In `@crates/gl/src/clone.rs`:
- Around line 262-267: Replace the loose serde_json::Value parsing for resp in
clone.rs with a strongly typed EncryptedBlobsResponse struct so schema
mismatches surface as errors: define EncryptedBlobsResponse { blobs:
Vec<EncryptedBlobOrString> } (matching the server shape, analogous to
WithheldPathsResponse), then call
resp.json::<EncryptedBlobsResponse>().await.context("parsing encrypted-blobs")?
and use the returned .blobs directly instead of
body.get(...).and_then(...).cloned().unwrap_or_default(); this ensures missing
or mistyped "blobs" no longer silently becomes [] and will fail fast (update the
EncryptedBlobOrString type to the actual blob element type used).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: fab81d81-dc64-46fe-b486-709131c8a108
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
crates/gitlawb-core/Cargo.tomlcrates/gitlawb-core/src/encrypt.rscrates/gitlawb-core/src/identity.rscrates/gitlawb-core/src/lib.rscrates/gitlawb-node/Cargo.tomlcrates/gitlawb-node/src/api/encrypted.rscrates/gitlawb-node/src/api/mod.rscrates/gitlawb-node/src/api/repos.rscrates/gitlawb-node/src/api/visibility.rscrates/gitlawb-node/src/arweave.rscrates/gitlawb-node/src/db/mod.rscrates/gitlawb-node/src/encrypted_pin.rscrates/gitlawb-node/src/git/mod.rscrates/gitlawb-node/src/git/smart_http.rscrates/gitlawb-node/src/git/visibility_pack.rscrates/gitlawb-node/src/ipfs_pin.rscrates/gitlawb-node/src/main.rscrates/gitlawb-node/src/pinata.rscrates/gitlawb-node/src/server.rscrates/gitlawb-node/src/sync.rscrates/gitlawb-node/src/visibility.rscrates/gl/src/clone.rscrates/gl/src/main.rs
| /// The raw 32-byte Ed25519 seed. Used to derive the X25519 secret for | ||
| /// envelope decryption (see `crate::encrypt`). | ||
| pub fn seed_bytes(&self) -> [u8; 32] { | ||
| self.signing_key.to_bytes() | ||
| } |
There was a problem hiding this comment.
Avoid adding a non-zeroizing raw-seed accessor.
to_seed() already exposes this material behind Zeroizing; seed_bytes() reintroduces the same long-lived signing seed as a plain [u8; 32]. That copy now flows into open_blob and the async encrypted-pin path, which widens private-key lifetime without adding any new capability. Prefer reusing to_seed() here, or make this helper return a zeroizing wrapper instead of raw bytes. This already shows up in crates/gitlawb-node/src/api/repos.rs:659-707, where the seed is moved into the spawned encrypted-pin task.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-core/src/identity.rs` around lines 55 - 59, The new
seed_bytes() method leaks a non-zeroizing copy of the Ed25519 seed; replace its
usage with the existing to_seed() which returns a Zeroizing<[u8;32]>, or change
seed_bytes() to return a zeroizing wrapper (e.g. Zeroizing<[u8;32]>) instead of
plain [u8;32]; update callers such as the code paths that call seed_bytes()
(including open_blob and the encrypted-pin task in
crates/gitlawb-node/src/api/repos.rs) to accept and use the Zeroizing seed or to
call to_seed() directly so the raw seed is never held in plain memory.
| // withheld_blob_oids walks every ref with blocking `git ls-tree`; keep that | ||
| // off the async worker thread. | ||
| let withheld = { | ||
| let path = disk_path.clone(); | ||
| let rules = rules.clone(); | ||
| let owner_did = record.owner_did.clone(); | ||
| let caller_owned = caller.map(str::to_string); | ||
| let is_public = record.is_public; | ||
| tokio::task::spawn_blocking(move || { | ||
| visibility_pack::withheld_blob_oids( | ||
| &path, | ||
| &rules, | ||
| is_public, | ||
| &owner_did, | ||
| caller_owned.as_deref(), | ||
| ) | ||
| }) | ||
| .await | ||
| .map_err(|e| { | ||
| let msg = e.to_string(); | ||
| if msg.contains("bad line length") || msg.contains("protocol error") { | ||
| tracing::warn!(repo = %name, err = %msg, "git-upload-pack: bad client request"); | ||
| AppError::BadRequest(msg) | ||
| } else { | ||
| tracing::error!(repo = %name, err = %msg, "git-upload-pack failed"); | ||
| AppError::Git(msg) | ||
| } | ||
| })?; | ||
| .map_err(|e| AppError::Git(e.to_string()))? | ||
| .map_err(|e| AppError::Git(e.to_string()))? | ||
| }; | ||
|
|
||
| let resp = if withheld.is_empty() { | ||
| smart_http::upload_pack(&disk_path, body).await | ||
| } else { | ||
| tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack"); | ||
| smart_http::upload_pack_excluding(&disk_path, body, &withheld).await | ||
| } |
There was a problem hiding this comment.
Skip the withheld walk when no path-scoped rule can withhold anything.
After the "/" gate above, repos whose rule set is empty or root-only cannot produce a non-empty withheld blob set, but this branch still runs the full withheld_blob_oids() ref walk before falling back to the ordinary upload_pack(). That adds an extra full repo scan to every clone/fetch on the common public path.
💡 Minimal short-circuit
- let withheld = {
- let path = disk_path.clone();
- let rules = rules.clone();
- let owner_did = record.owner_did.clone();
- let caller_owned = caller.map(str::to_string);
- let is_public = record.is_public;
- tokio::task::spawn_blocking(move || {
- visibility_pack::withheld_blob_oids(
- &path,
- &rules,
- is_public,
- &owner_did,
- caller_owned.as_deref(),
- )
- })
- .await
- .map_err(|e| AppError::Git(e.to_string()))?
- .map_err(|e| AppError::Git(e.to_string()))?
- };
-
- let resp = if withheld.is_empty() {
+ let has_path_scoped_rules = rules.iter().any(|r| r.path_glob != "/");
+ let resp = if !has_path_scoped_rules {
smart_http::upload_pack(&disk_path, body).await
} else {
- tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack");
- smart_http::upload_pack_excluding(&disk_path, body, &withheld).await
+ let withheld = {
+ let path = disk_path.clone();
+ let rules = rules.clone();
+ let owner_did = record.owner_did.clone();
+ let caller_owned = caller.map(str::to_string);
+ let is_public = record.is_public;
+ tokio::task::spawn_blocking(move || {
+ visibility_pack::withheld_blob_oids(
+ &path,
+ &rules,
+ is_public,
+ &owner_did,
+ caller_owned.as_deref(),
+ )
+ })
+ .await
+ .map_err(|e| AppError::Git(e.to_string()))?
+ .map_err(|e| AppError::Git(e.to_string()))?
+ };
+ if withheld.is_empty() {
+ smart_http::upload_pack(&disk_path, body).await
+ } else {
+ tracing::info!(repo = %name, caller = ?caller, withheld = withheld.len(), "serving filtered pack");
+ smart_http::upload_pack_excluding(&disk_path, body, &withheld).await
+ }
}
.map_err(|e| {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/api/repos.rs` around lines 398 - 425, Short-circuit
the expensive withheld walk by detecting when the repo's rules cannot withhold
any path-scoped blobs (e.g., rules is empty or only applies to root) before
calling visibility_pack::withheld_blob_oids; if that check shows no possible
withheld blobs, skip the tokio::task::spawn_blocking(...) and call
smart_http::upload_pack(&disk_path, body).await directly, otherwise proceed to
spawn_blocking and then choose between smart_http::upload_pack and
smart_http::upload_pack_excluding based on the resulting withheld set; use the
existing variables and functions (rules, visibility_pack::withheld_blob_oids,
withheld, smart_http::upload_pack, smart_http::upload_pack_excluding) to
implement the early return.
| let keys: Vec<VerifyingKey> = dids.iter().filter_map(|d| did_to_key(d)).collect(); | ||
| if keys.is_empty() { | ||
| tracing::warn!(oid = %oid, "no resolvable recipient keys; skipping encrypted pin"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Do not encrypt to a partial recipient set when some DIDs fail key resolution.
At Line 72, filter_map drops unresolved DIDs and still seals for the remaining keys. That can make authorized readers unrecoverable even though the blob is recorded as successfully encrypted/pinned. Treat any unresolved DID as a per-blob failure and skip instead of partially sealing.
💡 Suggested fix
- let keys: Vec<VerifyingKey> = dids.iter().filter_map(|d| did_to_key(d)).collect();
- if keys.is_empty() {
- tracing::warn!(oid = %oid, "no resolvable recipient keys; skipping encrypted pin");
+ let mut keys = Vec::with_capacity(dids.len());
+ let mut unresolved = Vec::new();
+ for did in dids {
+ match did_to_key(did) {
+ Some(k) => keys.push(k),
+ None => unresolved.push(did.clone()),
+ }
+ }
+ if !unresolved.is_empty() {
+ tracing::warn!(oid = %oid, unresolved = ?unresolved, "some recipient DIDs could not be resolved; skipping encrypted pin");
+ continue;
+ }
+ if keys.is_empty() {
+ tracing::warn!(oid = %oid, "empty recipient set; skipping encrypted pin");
continue;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/encrypted_pin.rs` around lines 72 - 76, The current
code builds `keys: Vec<VerifyingKey>` by `filter_map(|d| did_to_key(d))`, which
silently drops unresolved DIDs and proceeds to encrypt to a partial recipient
set; change this so any unresolved DID causes the blob to be skipped: instead of
`filter_map`, resolve all DIDs and detect failures (e.g., collect Options or
compare `keys.len()` to `dids.len()` after mapping via `did_to_key`), and when a
mismatch is found log a warning referencing `oid` (similar to the existing log)
and `continue` so you do not seal to a partial set; ensure you keep the
`VerifyingKey` handling consistent with downstream code that expects `keys`.
| let data = match crate::git::store::read_object(repo_path, oid) { | ||
| Ok(Some((_t, bytes))) => bytes, | ||
| _ => continue, | ||
| }; |
There was a problem hiding this comment.
Add warnings for object-read and IPFS pin failures instead of silent skips.
At Line 77 and Line 88, failures currently continue without logs. This makes production retry/debugging hard and weakens observability during partial outages.
💡 Suggested fix
- let data = match crate::git::store::read_object(repo_path, oid) {
- Ok(Some((_t, bytes))) => bytes,
- _ => continue,
- };
+ let data = match crate::git::store::read_object(repo_path, oid) {
+ Ok(Some((_t, bytes))) => bytes,
+ Ok(None) => {
+ tracing::warn!(oid = %oid, "blob object not found locally; skipping encrypted pin");
+ continue;
+ }
+ Err(e) => {
+ tracing::warn!(oid = %oid, err = %e, "failed reading blob object; skipping encrypted pin");
+ continue;
+ }
+ };
@@
- let cid = match crate::ipfs_pin::pin_git_object(ipfs_api, oid, &envelope).await {
- Ok(c) if !c.is_empty() => c,
- _ => continue,
- };
+ let cid = match crate::ipfs_pin::pin_git_object(ipfs_api, oid, &envelope).await {
+ Ok(c) if !c.is_empty() => c,
+ Ok(_) => {
+ tracing::warn!(oid = %oid, "pin_git_object returned empty CID; skipping record");
+ continue;
+ }
+ Err(e) => {
+ tracing::warn!(oid = %oid, err = %e, "pin_git_object failed; skipping encrypted pin");
+ continue;
+ }
+ };Also applies to: 88-90
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/encrypted_pin.rs` around lines 77 - 80, Replace
silent `continue` branches when `crate::git::store::read_object(repo_path, oid)`
and the IPFS pin operation fail with warning logs that include contextual
details (repo_path, oid, and the error/result) so failures are observable; in
the match arms around `let data = match
crate::git::store::read_object(repo_path, oid) { ... }` and the subsequent IPFS
pin match, call the module’s logger (e.g., process_logger.warn or log::warn)
with a concise message and the error/result, then continue as before to preserve
behavior.
| /// Fetch raw bytes for a CID from the local Kubo node (`/api/v0/cat`). | ||
| pub async fn cat(ipfs_api: &str, cid: &str) -> Result<Vec<u8>> { | ||
| if ipfs_api.is_empty() { | ||
| return Err(anyhow::anyhow!("IPFS not configured")); | ||
| } | ||
| let url = format!("{}/api/v0/cat?arg={}", ipfs_api.trim_end_matches('/'), cid); | ||
| let resp = reqwest::Client::new().post(&url).send().await?; | ||
| if !resp.status().is_success() { | ||
| return Err(anyhow::anyhow!("ipfs cat {cid}: {}", resp.status())); | ||
| } | ||
| Ok(resp.bytes().await?.to_vec()) |
There was a problem hiding this comment.
Bound the Kubo /cat request.
This new helper creates a fresh reqwest::Client with no timeout, so a slow or stalled Kubo fetch can hang both get_encrypted_blob and the sync worker indefinitely. Please reuse a client with a bounded timeout, or set a per-request timeout here.
⏱️ Suggested fix
pub async fn cat(ipfs_api: &str, cid: &str) -> Result<Vec<u8>> {
if ipfs_api.is_empty() {
return Err(anyhow::anyhow!("IPFS not configured"));
}
let url = format!("{}/api/v0/cat?arg={}", ipfs_api.trim_end_matches('/'), cid);
- let resp = reqwest::Client::new().post(&url).send().await?;
+ let resp = reqwest::Client::new()
+ .post(&url)
+ .timeout(std::time::Duration::from_secs(30))
+ .send()
+ .await?;
if !resp.status().is_success() {
return Err(anyhow::anyhow!("ipfs cat {cid}: {}", resp.status()));
}
Ok(resp.bytes().await?.to_vec())
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/ipfs_pin.rs` around lines 75 - 85, The cat function
creates a reqwest::Client with no timeout (reqwest::Client::new()) which can
hang get_encrypted_blob and the sync worker; modify cat (or callers) to use a
client with a bounded timeout (e.g., build a Client via
reqwest::Client::builder().timeout(Duration::from_secs(...)) and reuse it) or
apply a per-request timeout (e.g., wrap the send/bytes await with
tokio::time::timeout) so the POST to {}/api/v0/cat?arg={} fails fast on
slow/stalled Kubo; update the cat signature or module to accept/reuse a Client
or enforce a timeout around resp = client.post(&url).send().await and
resp.bytes().await to avoid indefinite hangs.
| let withheld = fetch_withheld(client, &origin_url, owner_short, repo_name).await; | ||
| let mode = classify_mirror(withheld); | ||
|
|
||
| let result = if local_path.exists() { | ||
| fetch_repo(&local_path, &remote_url).await | ||
| fetch_repo(&local_path, &remote_url, mode).await | ||
| } else { | ||
| clone_repo(&remote_url, &local_path).await | ||
| clone_repo(&remote_url, &local_path, mode).await | ||
| }; |
There was a problem hiding this comment.
Don’t downgrade an unknown mirror state to Plain.
fetch_withheld collapses 404, 5xx, network, and parse failures into None, and this branch always turns that into MirrorMode::Plain. For an existing promisor mirror, that flows into fetch_repo(..., MirrorMode::Plain), which clears remote.origin.promisor and remote.origin.partialclonefilter before refetching. A transient withheld-paths outage on a still-withheld repo will therefore start failing syncs until the endpoint recovers. Only downgrade on Some(empty), and preserve the current promisor mode when the lookup is unknown.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/sync.rs` around lines 181 - 188, The code currently
treats a None result from fetch_withheld as MirrorMode::Plain which incorrectly
downgrades unknown/error states; instead, check the Option returned by
fetch_withheld and only set MirrorMode::Plain when you get Some(empty); if
fetch_withheld returns None preserve the repository's existing
promisor/partialclone state (do not call classify_mirror on None). Concretely,
change the branch around fetch_withheld/classify_mirror to: if
withheld.is_some() { let mode = classify_mirror(withheld); } else { read the
current repo promisor/partialclone setting from the local repo at local_path (or
otherwise keep the current MirrorMode) and use that mode for
fetch_repo/clone_repo }, so fetch_repo/clone_repo are invoked with the preserved
mode rather than unconditionally using MirrorMode::Plain.
| let owner_short = owner.split(':').next_back().unwrap_or(owner); | ||
| let slug = format!("{owner_short}/{name}"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Client-side Arweave lookup normalization:"
rg -n -C2 'owner_short|slug = format!|graphql' crates/gl/src/clone.rs
echo
echo "Server-side encrypted-manifest tag construction:"
rg -n -C4 'encrypted-manifest|Schema|Repo|App-Name' crates/gitlawb-node/src/arweave.rs crates/gitlawb-node/src/api/repos.rsRepository: Gitlawb/node
Length of output: 20700
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Find EncryptedManifest and where `manifest.repo` is set:"
rg -n "struct EncryptedManifest|manifest\.repo|EncryptedManifest<'_>" crates/gitlawb-node/src/arweave.rs
echo
echo "Search for `repo:` field construction / slug derivation in node code:"
rg -n "Repo\s*[:=]|repo\s*:\s*|format!\(.*owner.*name|owner.*split\(':'\)|next_back\(\)" crates/gitlawb-node/src/arweave.rs crates/gitlawb-node/src -SRepository: Gitlawb/node
Length of output: 10904
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== crates/gitlawb-node/src/arweave.rs (top + EncryptedManifest + tag header) ==="
nl -ba crates/gitlawb-node/src/arweave.rs | sed -n '1,260p'
echo
echo "=== crates/gl/src/clone.rs (Arweave fallback slug + GraphQL variables) ==="
nl -ba crates/gl/src/clone.rs | sed -n '360,520p'Repository: Gitlawb/node
Length of output: 199
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "arweave.rs lines:"
wc -l crates/gitlawb-node/src/arweave.rs
echo
echo "crates/gitlawb-node/src/arweave.rs: EncryptedManifest + manifest_json payload + build_manifest_tags_header + any construction sites:"
cat -n crates/gitlawb-node/src/arweave.rs | sed -n '1,260p'
echo
echo "clone.rs: Arweave fallback slug + GraphQL body:"
wc -l crates/gl/src/clone.rs
echo
cat -n crates/gl/src/clone.rs | sed -n '380,480p'Repository: Gitlawb/node
Length of output: 15653
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Call sites for anchor_encrypted_manifest / EncryptedManifest construction ==="
rg -n "anchor_encrypted_manifest|EncryptedManifest\s*\{" crates/gitlawb-node/src
echo
echo "=== Inspect around likely repo_slug construction (owner_short + repo_name) in api/repos.rs ==="
rg -n "owner_short\s*=.*split\(':'\)\.next_back\(\)\.unwrap_or|repo_slug\s*=\s*format!" crates/gitlawb-node/src/api/repos.rs -n
sed -n '650,820p' crates/gitlawb-node/src/api/repos.rs | cat -nRepository: Gitlawb/node
Length of output: 9929
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== clone.rs: encrypted-blobs parsing ==="
rg -n "/encrypted-blobs|encrypted-blobs|recover_encrypted_blobs|serde_json::Value" crates/gl/src/clone.rs
echo
echo "=== clone.rs: recover_encrypted_blobs implementation ==="
# Print the function body area (best-effort) around its definition
rg -n "async fn recover_encrypted_blobs" -n crates/gl/src/clone.rs | head -n 5
# Use line ranges around likely area
sed -n '1,220p' crates/gl/src/clone.rs | head -n 220
echo
echo "=== clone.rs: fetch_withheld + schema strictness for withheld-paths ==="
rg -n "withheld-paths|fetch_withheld|recover_from_arweave" crates/gl/src/clone.rsRepository: Gitlawb/node
Length of output: 9912
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# encrypted-blobs + parsing behavior
echo "=== clone.rs: recover_encrypted_blobs + parse_tx_ids ==="
sed -n '220,420p' crates/gl/src/clone.rs
echo
echo "=== clone.rs: WithheldPathsResponse parsing (schema strictness / fail-open) ==="
sed -n '180,260p' crates/gl/src/clone.rs
echo
echo "=== clone.rs: tests relevant to strictness / arweave recovery ==="
rg -n "withheld-paths|encrypted-blobs|recover_from_arweave|unauthorized|happy|schema" crates/gl/src/clone.rsRepository: Gitlawb/node
Length of output: 11990
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '740,820p' crates/gl/src/clone.rsRepository: Gitlawb/node
Length of output: 3031
Fix Arweave manifest discovery pagination; encrypted-blobs parsing should be schema-strict
- Arweave
Reposlug normalization matches: the client uses{last-':' owner segment}/{name}, and the node anchorsEncryptedManifest.repowith the sameowner_did.split(':').next_back()short-form, taggingRepo:{manifest.repo}. - Arweave GraphQL discovery hard-codes
first:100; if a repo has >100 encrypted-manifest anchors under the sameRepotag, recovery can miss the newest per-blob manifests and leave still-live blobs unrecovered—needs pagination/iteration until coverage is complete. /encrypted-blobsparsing is fail-open on schema drift:recover_encrypted_blobsusesValueand defaultsblobsto[]viaunwrap_or_default(), unlike/withheld-pathswhich deserializes a typed struct and errors on missing/wrong fields; switch/encrypted-blobsto a strict typed response and fail on missing/wrong types.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gl/src/clone.rs` around lines 415 - 416, The slug normalization and
Arweave discovery/parsing need tightening: ensure owner short-form uses the same
split logic as the node by keeping owner_short =
owner.split(':').next_back().unwrap_or(owner) and forming slug =
format!("{owner_short}/{name}"); update the GraphQL discovery for
encrypted-manifest anchors in discover/recovery code to paginate instead of
hard-coding first:100 (iterate queries until no more edges or timestamps
covered) so recover_encrypted_blobs sees all anchors; change
recover_encrypted_blobs to deserialize the /encrypted-blobs response into a
strict typed struct (like the existing withheld-paths type) and fail on
missing/wrong fields instead of using Value with unwrap_or_default(), ensuring
schema-strict parsing and surfaced errors.
| let query = r#"query($repo:String!){transactions(tags:[{name:"App-Name",values:["gitlawb"]},{name:"Schema",values:["gitlawb/encrypted-manifest/v1"]},{name:"Repo",values:[$repo]}],first:100){edges{node{id}}}}"#; | ||
| let gql_body = serde_json::json!({ "query": query, "variables": { "repo": slug } }); | ||
| let resp = match client | ||
| .post(format!("{ag}/graphql")) | ||
| .json(&gql_body) | ||
| .send() | ||
| .await | ||
| { | ||
| Ok(r) if r.status().is_success() => r, | ||
| _ => return Ok(vec![]), | ||
| }; | ||
| let gql: serde_json::Value = match resp.json().await { | ||
| Ok(v) => v, | ||
| Err(_) => return Ok(vec![]), | ||
| }; | ||
| let tx_ids = parse_tx_ids(&gql); | ||
| if tx_ids.is_empty() { | ||
| return Ok(vec![]); | ||
| } | ||
|
|
||
| // 2. Fetch and parse each manifest body, then merge latest-wins per oid. | ||
| let mut manifests = Vec::new(); | ||
| for tx in tx_ids { | ||
| let m = match client.get(format!("{ag}/{tx}")).send().await { | ||
| Ok(r) if r.status().is_success() => r, | ||
| _ => continue, | ||
| }; | ||
| if let Ok(parsed) = m.json::<Manifest>().await { | ||
| manifests.push(parsed); | ||
| } | ||
| } | ||
| let oid_cid = merge_manifests(manifests); |
There was a problem hiding this comment.
Paginate manifest discovery instead of truncating at 100 transactions.
This fallback only inspects the first 100 manifest txs, then merges per-push manifests to find the latest CID per OID. Once a repo has more than 100 pushes, any private blob whose last seal is older than that window disappears from recovery even if it is still live at HEAD.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gl/src/clone.rs` around lines 428 - 459, The current GraphQL fetch
(the local `query` + `gql_body`, use of
`client.post(...).json(&gql_body).send().await`, and subsequent
`parse_tx_ids(&gql)`) only requests first:100 and stops, causing missing
manifests; change the logic to paginate: rewrite the GraphQL query to take
variables for "first" and "after" (cursor), then loop POSTing the query
(building `gql_body` with `{ "repo": slug, "first": batch_size, "after": cursor
}`), call `parse_tx_ids` (or adjust it to return edges + pageInfo) to collect tx
ids across pages, update `cursor` from pageInfo.nextCursor/hasNextPage until no
more pages, then proceed to fetch manifests and call
`merge_manifests(manifests)` as before; ensure the loop uses the same
`client.post(...).json(&gql_body).send().await` error handling as existing code
and retains the existing behavior when responses fail.
|
hello @beardthelion do you want to be a maintainer of this node repository? I would like to add you as one. let me know if you are interested. |
|
Thanks @kevincodex1, I'd be glad to. I've enjoyed working on this and I'm happy to take on more responsibility for it. Couple of things so I know what I'm signing up for: what's the expectation for the role, just merge access, or also triage/releases and general upkeep? And who else maintains it currently? Either way, count me in. |
|
please rebase and fix conflicts |
90dc8d4 to
67afb20
Compare
…adability
The origin no longer stores recipient DIDs. Migration v5 replaces the
encrypted_blobs.recipients column with an opaque, node-keyed recipients_tag
used only to detect a recipient-set change for re-seal. B1 discovery and fetch
are now gated by the same repo-readability check the git read path uses, not by
per-recipient matching; decryption is gated by the envelope crypto, so a
non-recipient who can read the repo sees a blob's {oid, cid} but cannot open it.
encrypt_and_pin keys the tag from the node seed and returns {oid, cid}; the
Arweave manifest tuple drops the now-unused recipient vec.
A DB compromise no longer reveals the reader set; recovering it would require
brute-forcing candidate DID sets against the keyed tag with the node key.
…rror
Address review on the at-rest blinding change:
- The encrypted-blobs/replicate listing returned {oid, cid} with no visibility
check, so a non-readable repo's blob index was reachable by an unauthenticated
caller who guessed {owner}/{repo}. Gate it by the same repo-readability check
discovery and fetch use. For the intended case (a public repo with withheld
subtrees) the public root keeps this open to peers; only fully non-readable
repos are withheld, which is the desired behavior.
- encrypt_and_pin treated a recipients_tag DB read error as a cache miss and
resealed, causing avoidable IPFS writes during a partial outage; skip and retry
on the next push instead.
- Correct the get_encrypted_blob doc comment to describe repo-readability access.
…nly pin set blob_paths walked only refs/heads/* and refs/tags/* and skipped silently on a failed git ls-tree, so a blob reachable only through another namespace, or a ref that failed to traverse, could fall out of the withheld set and ship in cleartext. Walk every ref and fail closed on traversal error. The pin enumerators (ipfs_pin, pinata) used git cat-file --batch-all-objects, which includes unreachable/dangling objects that have no path and cannot be classified for withholding. Switch them to git rev-list --objects --all so the pin set matches the reachable graph blob_paths evaluates.
67afb20 to
cd5cda4
Compare
Closes #42.
blob_pathsinvisibility_pack.rsbuilds the blob→path map that every withholding decision rests on. Two flaws let it produce an incomplete withheld set, which downstream code then treats as public, leaking withheld content in cleartext:Ref-namespace gap. It walked only
refs/heads/*andrefs/tags/*, but the serve path enumeratesgit rev-list --objects --all(every object reachable from any ref). A blob reachable only through another namespace (e.g.refs/notes/*) never entered the withheld set and shipped in cleartext. Fix: drop the heads/tags filter and walk every ref, still per-ref viagit ls-tree -rso all paths a blob lives at are seen (a blob is withheld only if denied at every path; if it is public at some allowed path it must still be sent). Note this is deliberately notrev-list --objects, which reports one representative path per object and would regress that correctness.Fail-open ref walk. A non-zero
git ls-treeexit was silently skipped, yielding a partial set. Fix: fail closed (bail!). All three callers already abort on error (serve propagates, push/encrypt-pin skip the block), so no path serves unfiltered.It also narrows the two pin enumerators (
ipfs_pin,pinata) fromgit cat-file --batch-all-objectstogit rev-list --objects --all. The old call included unreachable/dangling objects, which have no ref and no tree path, so visibility rules cannot classify them; a dangling copy of a secret blob would bypass withholding straight into a permanent IPFS pin. Reachable-only keeps the pin set aligned with whatblob_pathscan evaluate.Four tests added: a blob reachable only via a non-head/tag ref is still withheld; an untraversable ref makes the withheld computation return
Err; and a dangling blob is absent from each pin enumerator while a committed blob is present.Stacks on #40 (base=main), carrying the cumulative Phase-2/3 prereq diff since these files exist only in the unmerged stack.
Summary by CodeRabbit
gl clonecommand supporting partial clones with content filtering and automatic blob recovery